Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation
نویسندگان
چکیده
This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rule-based approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. Applying the normalization rules to texts by Luther results in 91% exact matches, clearly outperforming the baseline (65%). Matches can be improved to 93% by combining the approach with a word substitution list. If applied to more diverse language data from roughly the same period, performance goes down to 42% exact matches (baseline: 32%), but is higher than using a wordlist. The results show that rules derived from a highly different type of text can support normalization to a certain extent.
منابع مشابه
Normalizing Medieval German Texts: from rules to deep learning
The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....
متن کاملRule-Based Normalization of Historical Texts
This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rulebased approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. The eva...
متن کاملPOS Tagging for Historical Texts with Sparse Training Data
This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to preprocess the texts before applying a POS tagger trained on modern German corpora. Using only 250 manually normalized tokens as training data, the tagging accuracy of a manuscript from the 15th cen...
متن کاملA Study on the Commentary of Historical Verses with an Emphasis on the Rule of Al-Ibrah
One of the prevalent commentary rules about commentary of the historical verses which has a certain revelation occasion and refers to a specific time and place is the rule of alibrah being stated as: take in consideration universality of the word not particularity of the occasion. The source of this rule refers to the verses which have universal word and particular occasion. The referent of the...
متن کاملNormalization of qPCR array data: a novel method based on procrustes superimposition
MicroRNAs (miRNAs) are short, endogenous non-coding RNAs that function as guide molecules to regulate transcription of their target messenger RNAs. Several methods including low-density qPCR arrays are being increasingly used to profile the expression of these molecules in a variety of different biological conditions. Reliable analysis of expression profiles demands removal of technical variati...
متن کامل